A simple sketching algorithm for entropy estimation

نویسندگان

  • Peter Clifford
  • Ioana Ada Cosma
چکیده

We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream when space limitations make exact computation infeasible. It is known that αdependent quantities such as the Rényi and Tsallis entropies can be estimated efficiently and unbiasedly from low-dimensional α-stable data sketches. An approximation to the Shannon entropy can be obtained from either of these quantities by taking α sufficiently close to 1. However, practical guidelines for the choice of α are lacking. We avoid this problem by going directly to the limit. We show that the projection variables used in estimating the Rényi entropy can be transformed to have a proper distributional limit as α approaches 1. The Shannon entropy can then be estimated directly from a data sketch based on this limiting distribution. We derive properties of the distribution, showing that it has a surprisingly simple characteristic function (iθ) and that the kth moment of the exponential of such a variable is k for all non-negative real values of k. These properties enable the Shannon entropy to be estimated directly from the associated data sketch as the logarithm of a simple average. We obtain the Fisher information for the statistical problem of recovering the entropy from the data sketch and hence a lower bound on the standard error of the estimated entropy. We show that our proposed estimator has theoretical statistical efficiency of 96.8% and confirm this with an empirical study. Finally we demonstrate that in order for the estimator to have 1+ ǫ coverage with high probability the sketch must have size O(1/ǫ), in agreement with theoretical bounds.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A simple sketching algorithm for entropy estimation over streaming data

We consider the problem of approximating the empirical Shannon entropy of a highfrequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Rényi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable da...

متن کامل

Sketching and Streaming High-Dimensional Vectors

A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a small-space algorithm given just one pass over the data, a so-call...

متن کامل

ADAPTIVE NEURO FUZZY INFERENCE SYSTEM BASED ON FUZZY C–MEANS CLUSTERING ALGORITHM, A TECHNIQUE FOR ESTIMATION OF TBM PENETRATION RATE

The  tunnel  boring  machine  (TBM)  penetration  rate  estimation  is  one  of  the  crucial  and complex  tasks  encountered  frequently  to  excavate  the  mechanical  tunnels.  Estimating  the machine  penetration  rate  may  reduce  the  risks  related  to  high  capital  costs  typical  for excavation  operation.  Thus  establishing  a  relationship  between  rock  properties  and  TBM pe...

متن کامل

Sketching for Nearfield Acoustic Imaging of Heavy-Tailed Sources

We propose a probabilistic model for acoustic source localization with known but arbitrary geometry of the microphone array. The approach has several features. First, it relies on a simple nearfield acoustic model for wave propagation. Second, it does not require the number of active sources. On the contrary, it produces a heat map representing the energy of a large set of candidate locations, ...

متن کامل

An Advanced State Estimation Method Using Virtual Meters

-  Power system state estimation is a central component in energy management systems of power system. The goal of state estimation is to determine the system status and power flow of transmission lines. This paper presents an advanced state estimation algorithm based on weighted least square (WLS) criteria by introducing virtual meters. For each bus of network, except slack bus, a virtual meter...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009